[PyTorch] Fix FlashAttention 2 head_dim > 192 on sm103 and other architectures by pedramr · Pull Request #2836 · NVIDIA/TransformerEngine

pedramr · 2026-04-04T19:08:04Z

Description

The head_dim > 192 gate for FlashAttention 2 in get_attention_backend used an exact-match
compute capability allowlist: (8,0), (9,0), (10,0), (12,0). This excluded sm103 (B300/GB300),
sm89 (L40S/RTX 4090), sm86 (A40/RTX 3090), and other valid architectures where flash-attn
supports head_dim up to 256.

This PR replaces the allowlist with a >= sm80 range check, matching flash-attn's own gate:
Dao-AILab/flash-attention@bbb21d6

The sm103 case was validated on hardware with head_dim=256; the remaining architectures appear
to be supported based on flash-attn's >= sm80 guarantee.

Type of change

Bug fix (non-breaking change which fixes an issue)

Changes

Replace exact-match compute capability allowlist with device_compute_capability < (8, 0) range check
Update debug log message from sm80/90/100+ to sm80+

…itectures Replace the exact-match compute capability allowlist with a >= sm80 range check, matching flash-attn's own gate: Dao-AILab/flash-attention@bbb21d6 The allowlist ((8,0), (9,0), (10,0), (12,0)) missed sm103 (B300), sm89 (L40S), sm86 (A40), and others where FA2 supports head_dim up to 256. The sm103 case was validated on hardware with head_dim=256; the remaining architectures appear to be supported based on flash-attn's >= sm80 guarantee. Signed-off-by: Pedram Razavi <pedram.razavi@gmail.com>

greptile-apps · 2026-04-04T19:09:55Z

Greptile Summary

This PR fixes a bug in get_attention_backend where FlashAttention 2 was incorrectly disabled for head_dim > 192 on architectures not in an exact-match allowlist (sm80, sm90, sm100, sm120), excluding valid devices like sm103, sm89, and sm86. The fix replaces the allowlist with the simpler head_dim_qk > 256 or head_dim_qk % 8 != 0 condition, correctly aligning with flash-attn's own >= sm80 support guarantee — which is already enforced earlier in the function at the compute-capability filter (lines 448–451).

Confidence Score: 5/5

Safe to merge — minimal, targeted bug fix with correct logic and no regressions introduced.

The change removes a single, clearly erroneous allowlist condition. The new condition is logically equivalent to flash-attn's own gate given the earlier < sm80 guard already disables FA2 before this point. The dead-code concern flagged in the previous review thread is fully resolved by removing the branch entirely. No new logic is added, and the log message is updated consistently.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Removes the exact-match compute-capability allowlist for head_dim > 192 in the FA2 filter, replacing it with the simpler and correct `head_dim_qk > 256 or head_dim_qk % 8 != 0` gate; log message updated to match.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_attention_backend called] --> B{device_compute_capability < sm80?}
    B -- Yes --> C[use_flash_attention_2 = False]
    B -- No --> D{use_flash_attention_2 AND FA2 installed?}
    D -- No --> G[Skip FA2 head_dim check]
    D -- Yes --> E{head_dim_qk > 256\nOR head_dim_qk % 8 != 0?}
    E -- Yes --> F[use_flash_attention_2 = False\nlog debug message]
    E -- No --> H[FA2 remains enabled]
    C --> I[Continue backend selection]
    F --> I
    G --> I
    H --> I

    style C fill:#f88,stroke:#c00
    style F fill:#f88,stroke:#c00
    style H fill:#8f8,stroke:#090

_{Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

cyanguwa · 2026-04-06T18:51:30Z

/te-ci L0

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-04-21T20:03:57Z

/te-ci pytorch

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 4, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/utils.py Outdated

ptrendx assigned cyanguwa Apr 9, 2026

ptrendx added 2 commits April 21, 2026 13:00

Addressed the review comments

b68b5cf

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge branch 'main' into fix/sm103-flash-attn-allowlist

ec8ef89

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

efbd8f2

for more information, see https://pre-commit.ci

cyanguwa added the 2.16.0 label Apr 22, 2026

cyanguwa approved these changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Fix FlashAttention 2 head_dim > 192 on sm103 and other architectures#2836

[PyTorch] Fix FlashAttention 2 head_dim > 192 on sm103 and other architectures#2836
pedramr wants to merge 4 commits intoNVIDIA:mainfrom
pedramr:fix/sm103-flash-attn-allowlist

pedramr commented Apr 4, 2026

Uh oh!

greptile-apps Bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

cyanguwa commented Apr 6, 2026

Uh oh!

ptrendx commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pedramr commented Apr 4, 2026

Description

Type of change

Changes

Uh oh!

greptile-apps Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

cyanguwa commented Apr 6, 2026

Uh oh!

ptrendx commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 4, 2026 •

edited

Loading